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Abstract 

Particle Metropolis-Hastings enables Bayesian parameter inference in general nonlinear state space 
models (SSMs). However, in many implementations a random walk proposal is used and this can 
result in poor mixing if not tuned correctly using tedious pilot runs. Therefore, we consider a new 
proposal inspired by quasi-Newton algorithms that may achieve similar (or better) mixing with less 
tuning. An advantage compared to other Hessian based proposals, is that it only requires estimates 
of the gradient of the log-posterior. A possible application is parameter inference in the challenging 
class of SSMs with intractable likelihoods. We exemplify this application and the benefits of the new 
proposal by modelling log-returns of future contracts on coffee by a stochastic volatility model with 
a-stable observations. 
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1 Introduction 


We are interested in Bayesian parameter inference in the nonlinear state space model (SSM) possibly 
with an intractable likelihood. An SSM with latent states x’o -t = {%t}t=o and observations y\,T is given 
by 


x t+1 \x t ~ fg(x t+1 \x t ), y t \x t ~ g e (y t |x t ), (1) 

with Xo ~ gg(xo) and where 9 £ 0 C denotes the static unknown parameters. Here, we assume that 
it is possible to simulate from the distributions pg{x o), fg(xt+i\xt) and ge(yt\xt), even if the respective 
densities are unavailable. 

The main object of interest in Bayesian parameter inference is the parameter posterior distribution, 

tt(6») = p(9\y 1:T ) oc pe(yv.T)p{&), (2) 


which is often intractable and cannot be computed in closed form. The problem lies in that the likelihood 
Peivi-.r) = p{vi-.t\Q) cannot be exactly computed. However, it can be estimated by computational 
statistical methods such as sequential Monte Carlo (SMC; Doucet and Jo hansenf 2011). The problem is 
further complicated when gg(yt\xt) cannot be evaluated point-wise, which prohibits direct application of 
SMC. This could be the result of that the density does not exist or that it is computationally prohibitive 
to evaluate. In both cases, we say that the likelihood of the SSM 0 is intractable. 

Recent efforts to develop methods for inference in models with intractable likelihoods have focused 
on approximate Bayesian computations (ABC; Marin et al. 2012). The main idea in ABC is that data 
simulated from the model (using the correct parameters) should be similar to the observed data. This 


idea can easily be incorporated into many existing inference algorithms, see Dean et al. 2014 


An example of this is the ABC version of particle Metropolis-Hastings (PMH-ABC; Jasra, 2015 


Bornn et al. 2014). In this algorithm, the intractable likelihood is replaced with an estimate obtained 


by the ABC version of SMC (SMC-ABC; Jasra et al. 2012). However, the random walk proposal 
often used in PMH(-ABC) can result in problems with poor mixing, which leads to high variance in 
the posterior estimates. The mixing can be improved by pre-conditioning the proposal with a matrix V, 


which typically is chosen as the unknown posterior covariance Sherlock et al. 2015 . However, estimating 
V can be challenging when p is large or the posterior ([2]) is non-isotropic. This typically results in that 
the user needs to run many tedious pilot runs when implementing PMH(-ABC) for parameter inference 
in a new SSM. 


Our main contribution is to adapt a limited-memory BFGS algorithm Nocedal and Wright 2006 


as a proposal in PMH(-ABC). This is based on earlier work by Dahlin et al. 2015a and Zhang and 
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Sutton 2011 . In the former, we discuss how to make use of gradient ascent and Newton-type proposals 


in PMH. The advantages of the new proposal are; (i) good mixing when gradient estimates are accurate, 
(ii) no tedious pilot runs required and (iii) only requires gradients to approximate the local Hessian. 
These advantages are important as direct estimation of gradients using SMC often is simpler than for 
Hessians. Note that this new proposal is useful both with and without the ABC approximation of the 
likelihood. 

To demonstrate these benefits we consider a linear Gaussian state space (LGSS) model, where we 
compare the performance of our proposal with and without ABC. Furthermore, we consider using a 


stochastic volatility model with a-stable log-returns Nolan 2003 to model future contracts on coffee. 


This model is common in the ABC literature as the likelihood is intractable, see Dahlin et al. 2015c 


Jasra 2015 and Yildirnn et al. 2014 


2 Particle Metropolis-Hastings 


A popular approach to estimate the parameter posterior ([2]) is to make use of statistical simulation 


methods. PMH Andrieu et al. 2010 is one such method and it operates by constructing a Markov 


chain, which has the sought posterior as its stationary distribution. As a result, we obtain samples from 
the posterior by simulating the Markov chain to convergence. 

The Markov chain targeting ([2| is constructed by an iterative procedure. During iteration fc, we 
propose a candidate parameter 9' ~ q(0'\8k-i,Uk-i) and an auxiliary variable u' ~ mg'(u') as detailed 
in the following using proposals q and mg’. The candidate is then accepted , i.e. {9k, Uk} 4— {9',u'}, with 
the probability 


t(8',9k-i,u',Uk-i ) = 


n(9'\u’) q(9 k _i\9' ,u') 

Tt{9 k _i\u k -i) q(0'\9 k -i,u k -i) 


(3) 


otherwise the parameter is rejected , i.e. {9 k ,u k } 4— {9 k _i,u k _i}. Here, w(9\u) = Pg(yi-.T\ u )p(9) denotes 
some unbiased estimate of 7 t{9) constructed using u and p(9) denotes the parameter prior distribution. 

PMH can be viewed as a Metropolis-Hastings algorithm in which the intractable likelihood is replaced 
with an unbiased noisy estimate. It is possible to show that this so-called exact approximation results in a 


valid algorithm as discussed by Andrieu and Roberts 2009 . Specifically, the Markov chain generated by 


PMH converges to the desired stationary distribution despite the fact that we are using an approximation 
of the likelihood. It is also possible to show that u can be included into the proposal q , which is necessary 


for including gradients and Hessians when proposing 9' as discussed by Dahlin et al. 2015a 


In Section[4j we discuss how to construct the proposal mg by running an SMC algorithm. In this case, 
the auxiliary variable u is the resulting generated particle system. We obtain PMH-ABC as presented in 
Algorithm[l] when SMC-ABC is used for Step 4. This is a complete procedure for generating K correlated 
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Algorithm 1 Particle Metropolis-Hastings (PMH) 

Inputs: K > 0 (no. MCMC steps), 0o (initial parameters) 
and {q,mg} (proposals). 

Output: {0i, ..., 9k} (approximate samples from the posterior). 

1: Generate u o ~ mg 0 and compute pe 0 {yi-.T\uo). 

2: for k = 1 to K do 
3: Sample 0'~ g(0'|0fc_i,u fc _i). 

4: Sample v! ~ mg' using Algorithm [2] 

5: Compute pg'(yi:T\u') using |Io|). 

6: Sample ui k uniformly over [0,1]. 

7: if Wfe < min{l,a(0 , ,0fe_i,u / ,Ufe_i)} given by then 

8: Accept 0', i.e. {9 k ,u k } «— {9',u'}. 

9: else 

10: Reject 0', i.e. {9 k ,u k } «- {0 fc _i,M fc -i}- 

11: end if 

12: end for 


samples {0 1; ..., 0 K } from © By the ergodic theorem, we can estimate any posterior expectation of an 
integrable test function tp : O —> K. (e.g. the posterior mean) by 


E 


<P(0)|j/l:T 


^ A 

<Pmh 


I 


K 


K - K b 




(4) 


k=K b 


which is a strongly consistent estimator if the Markov chain is ergodic Meyn and Tweedie, 2009 


Here, we discard the first K b samples known as the burn-in , i.e. before the chain reaches stationarity. 
Under geometric mixing conditions, the error of the estimate obeys the central limit theorem (when 
(K — K b ) —> oo) given by 


Vl< - Kb 


‘Pmh 


E [< P ( 0)|01 : t ] 



(5) 


where of? denotes the variance of the estimator. The variance is proportional to the inefficiency factor 
(IF), which describes the mixing of the Markov chain. Hence, we can use IF in the illustrations presented 
in Section [5] to compare the mixing between different proposals. 


3 Proposal for parameters 


To complete Algorithm [TJ we need to specify a proposal q from which we sample 9'. The choice of 
proposal is important as it is one of the factors that influences the mixing of resulting Markov chain. 

is 


The general form of a Gaussian proposal discussed in Dahlin et al. 2015a 


<z(0"|0', v!) = N{9"- m (0', u') , £(0', O), 


( 6 ) 
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Table 1: Different proposals for the PMH algorithm. 


Proposal 

m(0' X) 

E (9',u') 

PMHO 

9' 

7 

P. 

(MO 

UJ 

PMH1 

9'+ f [V- l G(ff\u , j] 

ejr- 1 

PMH2 

9' + §[H(9'\u')\~ 1 g(6'\u') 



where different choices of the mean function p,(9 ', v!) and covariance function T,(9', u r ) results in different 


versions 


of PMH as presented in Table [lj 


3.1 Zeroth and first order proposals (PMHO/1) 

PMHO is referred to as a zero order (or marginal) proposal as it only makes use of the last accepted 
parameter to propose the new parameter. Essentially, this proposal is a Gaussian random walk scaled 
by a positive semi-definite (PSD) preconditioning matrix V -1 . The performance of PMHO is highly 
dependent on V _1 , which is tedious and difficult to estimate as it should be selected as the unknown 
posterior covariance, see 


Sherlock et al. 2015 


Furthermore, it is known that gradient information can be useful to give the proposal a mode-seeking 
behaviour. This can be beneficial both initially to find the mode and for increasing mixing by keeping 
the Markov chain in areas with high posterior probability. This information can be included by making 
use of noisy gradient ascent update, where Q(9'\u') denotes the particle estimate of the gradient of the 
log-posterior given by G(9') = V \og j K{0)\e=e i ■ Again, we scale the step size and the gradient by V -1 
resulting in the PMH1 proposal. 

3.2 Second order proposal (PMH2) 

An alternative is to make use of a noisy Newton update as the proposal by replacing V with 7i(9'\u'), 
which denotes the particle estimate of the negative Hessian of the log-posterior given by 71(9') = 


—V 2 log7r(0)|g = 0/. This results in the second order PMH2 proposal discussed by Dahlin et al. 


2015a 


which relies on accurate estimates of the Hessian but these often require many particles and therefore 
incur a high computational cost. This problem is encountered in e.g. the ABC approximation of the 
a-stable model in (|9j) . 

The new quasi-PMH2 (qPMH2) proposal circumvents this problem by constructing a local approxi¬ 
mation of the Hessian based on a quasi-Newton update, which only makes use of gradient information. 


The update is inspired by the limited-memory BFGS algorithm Nocedal 1980 Nocedal and Wright 
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2006 given by 


B i+i(0') = (! P - PisiaJ)B l x (lp - pigisj) + pisis] 


(7) 


with p t 1 = gjSi, si = 0/(j) - 0m-i) and g t = 5(0/p) |uj(q) - G(9i(i-i) Kp-i))- The update is iterated 
over l £ {1,2,..., M— 1} with 1(1) = k — l, i.e. over the M— 1 previous states of the Markov chain. Hence, 
we refer to M as the memory length of the proposal. We initialise the update with B^ 1 = Pi (g[ gi) -1 I p 
and make use of a PMHO proposal with V = <5 _1 I p for the first M iterations, where 5 > 0 is defined by the 


user. The resulting estimate of the negative Hessian is given by 'H(9'\u') = —Bm(9')- See Appendix|B 
for more details regarding the implementation of the qPMH2 proposal and the complete algorithm in 
Algorithm [3] 

An apparent problem with using a quasi-Newton approximation of the Hessian is that the resulting 


proposal is no longer Markov (resulting in a non-standard MCMC). However, as shown by Zhang and 


Sutton 2011 , it is still possible to obtain a valid algorithm by viewing the chain as an M-dimensional 


Markov chain. In effect, this amounts to using the sample at lag M as the basis for the proposal. Hence, 
we set { 0',u'} = {9k-M,Uk-M} and £2 = 1 in the PMH2 proposal in Table [I] Furthermore, in case of a 
rejection we set {0fe,Mfc} = {0k-M,Uk-M}- We refer to Zhang and Sutton| [2011] for further details. 

The approximate Hessian TL(9'\u r ) has to be PSD to be a valid covariance matrix. This can be 
problematic when the Markov chain is located in areas with low posterior probability or sometimes due 
to noise in the gradient estimates. In our experience, this happens occasionally in the stationary regime, 
i.e. after the burn-in phase. However, when it happens 'H(9'\u') is corrected by the hybrid approach 


discussed by Dahlin et al. 2015aj. 


4 Proposal for auxiliary variables 

To implement qPMH2, we require estimates of the likelihood and the gradient of the log-posterior. These 
are obtained by running SMC-ABC which corresponds to simulating the auxiliary variables u. In this 


section, we show how to estimate the likelihood and its gradient by the fixed-lag (FL; Kitagawa and 


Sato 2001) smoother. 


4.1 SMC-ABC algorithm 


SMC-ABC Jasra et al. 2012 relies on a reformulation of the nonlinear SSM 0 - We start by perturbing 
the observations y t to obtain yi-x by 


Vt = tp(yt) + ez t , z t ~ p € , for t = l ,..., T, 


( 8 ) 
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Algorithm 2 Sequential Monte Carlo with approximate Bayesian computations (SMC-ABC) 

Inputs: jiu (perturbed data), the SSM N £ N (no. particles), e > 0 (tolerance parameter), 
A G [0,T) C N (lag). 

Outputs: p$(yi : T\u), Q(d\u) (est. of likelihood and gradient). 

Note: all operations are carried out over i,j = l,...,N. 


l: Sample Xq ~ pg(xo) lJ e{i’o\xo) and set w ^ = 1/N. 

2: for t = 1 to T do 

3 : Resample the particles by sampling a new ancestor index a^ 1 from a multinomial distribution with 

P(4 l) = j) = w t- 1- 

4 : Propagate the particles by sampling ~ 'Bg[x^\x^f 1 ) and extending the trajectory by x^\ = 

5 : Calculate the particle weights by w[ ^ = hg^(y t , x[^) which by normalisation (over i) gives w[ l \ 

end for 

Estimate pg(y\-T \u) by (10) and Q(9\u) by ©• 


where if) denotes a one-to-one transformation and p e denotes a kernel, e.g. Gaussian or uniform, with 
e as the bandwidth or tolerance parameter. We continue with assuming that there exists some random 
variables v t ~ ve(vt\xt) such that we can generate a sample from g$(yt\xt) by the transformation y t = 
Tg(yt , Xt). An example is the Box-Muller transformation to obtain a Gaussian random variable from two 
uniforms, see Appendix [X} 

To obtain the perturbed SSM, we introduce xj = (xj, vj ) as the new state variable with the dynamics 


X t+ i\x t ~ Eg(x t +i\x t ) = Vg{vt + i\xt+i)fg(x t+ i\xt), (9a) 

and the likelihood is modelled by 

y t \x t ~ h g>e (y t \x t ) = p £ (y t - ip(Tg(x t ))), (9b) 

which follows from the perturbation in ©■ With this reformulation, we can construct SMC-ABC as 
outlined in Algorithm [2j which is a standard SMC algorithm applied to the perturbed model. Note that, 
we do not require any evaluations of the intractable density gg(y t \x t ). Instead, we only simulate from 
this distribution and compare the simulated and observed (perturbed) data by p e . 

The accuracy of the ABC approximation is determined by e, where we recover the original formulation 
in the limit when e —> 0. In practice, this is not possible and we return to study the impact of a non-zero 
e in Section fsTTI 
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4.2 Estimation of the likelihood 


From Section [2j we require an unbiased estimate of the likelihood to compute the acceptance probability 
([3]). This can be achieved by using u generated by SMC-ABC. In this case, the auxiliary variables 
u = i}^_ 0 are the particle system composed of all the particles and their trajectories. The 

resulting likelihood estimator is given by 


P9(yv.T\u) = J] 


N 


N 

£ 


t(») 


( 10 ) 


where the unnormalised particle weights are deterministic functions of u. This is an unbiased 
and ./^-consistent estimator for the likelihood in the perturbed model. However, the perturbation itself 
introduces some bias and additional variance compared with the original unperturbed model Dean et al. 


2014 . The former can result in biased parameter estimates and the latter can result in poor mixing of 


the Markov chain. We return to study the impact on the mixing numerically in Section |5.1 


4.3 Estimation of the gradient of the log-posterior 

We also require estimates of the gradient of the log-posterior given u to implement the proposals in¬ 
troduced in Table [lj In Dahlin et al. 2015a , this is accomplished by using the FL smoother together 
with the Fisher identity. However, this requires accurate evaluations of the gradient of \oggg(y t \xt) with 
respect to 9. As discussed by Yildirim et al. [2014 , we can circumvent this problem by the reformulation 
of the SSM in © if the gradient of rg(xt) can be evaluated. This results in the gradient estimate 


T N 


g(6'\u')=X7 logp(9)\ g e/ + ££^^£ t - 1 ), 


( 11 ) 


t=l i=l 


= V log Ee{xt\xt~i)\ e o, + V log/leftist) I 


)-9’ 


where z^ t denotes the ancestor at time t of particle and z£_ 1:f = ^It}- 

The estimator in © relies on the assumption that the SSM is mixing quickly, which means that past 
states have a diminishing influence on future states and observations. More specifically, we assume that 
P9{xt\yv.T) ~ Pe{ x t\yi-.K t ), with Kt = min{i + A,T} and lag A G [0,T) C N. Note that this estimator 
is biased, but this is compensated for by the accept-reject step in Algorithm [T] and does not effect the 
stationary distribution of the Markov chain. See Dahlin et al. 2015a| for details. 


5 Numerical illustrations 

We evaluate qPMH2 by two illustrations with synthetic and real-world data. In the first model, we 
can evaluate gg(jjt\x t ) exactly in closed-form, which is useful to compare standard PMH and PMH- 
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ABC. In the second model, the likelihood is intractable and therefore only PMH-ABC can be used. See 
Appendix [X] for implementation details. 


5.1 Linear Gaussian SSM 

Consider the following LGSS model 


x t +i\x t 

yt\x t 


M[x t+ i\n + (j>(x t - It), , 
Af(y t ;x t , o.i 2 ), 


(12a) 

(12b) 


with 9 = {/x, <f), a v } and p £ R, </> £ (—1,1) and a v £ K+. A synthetic data set consisting of a realisation 
with T = 250 observations is simulated from the model using the parameters {0.2,0.8,1.0}. We begin 
by investigating the accuracy of Algorithm [2] for estimating the log-likelihood and the gradients of the 
log-posterior with respect to 9. The error of these estimates are computed by comparing with the true 
values obtain by a Kalman smoother. 

In Figure[I] we present the log-Li error of the log-likelihood and the gradients for different values of e. 
The error in the gradient with respect to <j> is not presented here, but is similar to the gradients for p. We 
see that the error in both the log-likelihood and the gradient are minimized when e ss 0.05 — 0.10. Here, 
SMC-ABC achieve almost the same error as standard SMC. However, when e grows larger SMC-ABC 
suffers from an increasing bias resulting from a deteriorating approximation in ([8]). We conclude that 
this results in a bias in the parameter estimates as the bias in the log-likelihood estimate propagates to 
the parameter posterior estimate. 


We now consider estimating the parameters in (12). In this model, we can compare standard PMH 
with PMH-ABC to study the impact using ABC to approximate the log-likelihood. For this comparison, 
we quantify the mixing of the Markov chain using the estimated IF given by 


IF (0K b :K) = 1 + 2 T ^Pl(0K b :K), 


(13) 


1=1 


where pi{0K b :K) denotes the empirical autocorrelation at lag l of 9fc b -.Kj and Kb is the burn-in time. A 
small value of IF indicates that we obtain many uncorrelated samples from the target distribution. This 
implies that the chain is mixing well and that cr 2 in ([5]) is rather small. We make use of two different 
L to compare the mixing: (i) the smallest L such that \pl(9 K b -.K)\ < 2/VK — Kb (i.e. when statistical 
significant is lost) and (ii) a fixed L = 1, 000. 

In Table |2] we present the minimum and maximum IFs as the median and interquartile range (IQR) 
computed using 10 Monte Carlo runs over the same data set. We note the good performance of the 
qPMH2 proposal which achieves similar (or better) mixing compared to the pre-conditioned proposals. 
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Figure 1: The log-Li-error in the log-likelihood (top) and gradient with respect to [i (lower left) and 
a v (lower right) using SMC-ABC when varying e. The shaded area and the dotted lines indicate maxi¬ 
mum/minimum error and the standard SMC error, respectively. The plots are created from the output 
of 100 Monte Carlo runs on a single synthetic data set. 
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Table 2: The IF values for the LGSS model (12). 



Alg. 

Acc. 

min 

IF 

max IF 

Median 

IQR 

Median 

IQR 


PMHO 

0.28 

12.13 

1.53 

13.71 

1.23 


PMH1 

0.78 

11.28 

0.50 

14.50 

1.45 

<P 

Oh 

qPMH2 

0.55 

3.00 

0.03 

3.01 

0.07 

c3 

PMHO-ABC 

0.14 

29.66 

10.37 

34.04 

9.36 


PMH1-ABC 

0.31 

33.09 

5.45 

38.32 

14.42 


qPMH2-ABC 

0.45 

3.00 

0.02 

3.03 

0.08 


PMHO 

0.28 

7.96 

9.60 

10.92 

6.00 


PMH1 

0.78 

9.47 

4.39 

10.60 

8.19 

O 

O 

qPMH2 

0.55 

5.40 

3.01 

8.98 

6.90 

|| 

PMHO-ABC 

0.14 

12.91 

8.86 

35.68 

3.22 

>-4 

PMH1-ABC 

0.31 

27.34 

23.63 

35.84 

30.31 


qPMH2-ABC 

0.45 

6.65 

5.14 

10.96 

6.71 


Remember that PMHO/1 are tuned to their optimal performance using tedious pilot runs, see Sherlock 
et al. I(? 015 ] and |Nemeth et alT| 12014| . We present some additional diagnostic plots for PMHO and qPMH2 
(without ABC) in Appendix [Cj 

In our experience, the performance of qPMH2 seems to be connected with the variance of Q(6'\u'). 


This is similar to the existing theoretical results for PMH1 Nemeth et al. 2014 . For the LGSS model, 


we can compute the gradient with a small variance and therefore qPMH2 performs well. However, this 
might not be the case for all models and the performance of qPMH2 thus depends on both the model 
and which particle smoother is applied. 

Finally, note that using ABC results in a smaller acceptance rate and worse mixing. This problem 


can probably be mitigated by adjusting e and N as discussed by Bornn et al. 2014 and Jasra 2015 


5.2 Modelling the volatility in coffee futures 

Consider the problem of modelling the volatility of the log-returns of future contracts on coffee using the 
T = 399 observations in Figure [2j A prominent feature in this type of data is jumps (present around 
March, 2014). These are typically the result of sudden changes in the market due to e.g. news arrivals 
and it has been proposed to make use of an a-stable distribution to model these jumps, see e.g.|Lombardi| 


and Calzolari 2009 and Dahlin et al. 2015c 


Therefore, we consider a stochastic volatility model with symmetric a-stable returns (aSV) given by 


x t +i\x t 

yt\x t 


M[x t+ i,n + (j>{x t - //), a^j , 
A(y t ;a,exp(x t j^, 
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(14a) 

(14b) 








































daily log-returns 


theoretical Gaussian quantiles 


Figure 2: Upper: log-returns (green) and smoothed log-volatility (orange) of futures on coffee between 
June 1, 2013 and December 31, 2014. The shaded area indicate the approximate 95% credibility region 
for the log-returns estimated from the model. Lower left: kernel density estimate (green) and a Gaussian 
approximation (magenta). Lower right: QQ-plot comparing the quantiles of the data with the best 
Gaussian approximation (magenta). 
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Figure 3: Parameter posteriors for (14) for /i (green), (orange), a v (purple) and a (magenta) obtained 
by pooling the output from 10 runs using qPMH2. The dashed lines indicate the corresponding estimates 
from PMHO. Dotted lines and grey densities indicate the estimate of the posterior means and the prior 
densities, respectively. 
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with 6 = {/j,,<j),cr Vi a}. Here, A(a,r]) denotes a symmetric a-stable distribution with stability parameter 
a £ (0,2) and scale parameter rj £ R + . As previously discussed, we cannot evaluate ge{yt\xt) for this 
model but it is possible to simulate from it. See Appendices [A] and [D] for more details regarding the 
a-stable distribution and methods for simulating random variables. 

In Figure [3j we present the posterior estimates obtained from qPMH2, which corresponds to the 
estimated parameter posterior mean 6 = {0.214, 0.931, 0.268,1.538}. This indicates a slowly varying log- 
volatility with heavy-tailed log-returns as the Cauchy and Gaussian distribution corresponds to a = 1 
and a = 2, respectively. We also compare with the posterior estimates from PMHO and conclude that 
they are similar in both location and scale. 

Finally, we present the smoothed estimate of the log-volatility using 6 in Figure [ 2 ] The estimate 
seems reasonable and tracks the periods with low volatility (around October, 2013) and high volatility 
(around May, 2014). 


6 Conclusions 


We have demonstrated that qPMH2 exhibits similar or improved mixing when compared with pre¬ 
conditioned proposals with/without the ABC approximation. The main advantage is that qPMH2 does 
not require extensive tuning of the step sizes in the proposal to achieve good mixing, which can be a 
problem in practice for PMH0/1. The user only needs to choose 6 and M, which in our experience are 
simpler to tune. Finally, qPMH2 only requires gradient information, which are usually simpler to obtain 
than directly estimate the Hessian of the log-posterior. 

In future work, it would be interesting to analyse the impact of the variance of the gradient estimates 
on the mixing of the Markov chain in qPMH2. In our experience, qPMH2 performs well when the variance 
is small or moderate. However when the variance increases, the mixing can be worse than for PMHO. 
This motivates further theoretical study and development of better particle smoothing techniques for 
gradient estimation to obtain gradient estimates with lower variance. 

Another extension of this work is to consider models where p is considerable larger than discussed 
in this paper. This would probably lead to an even greater increase in mixing when using qPMH2 


compared with PMHO. This effect is theoretically and empirically examined by Nemeth et al. 2014 . In 


this context, it would also be interesting to implement a quasi-Newton proposal in a particle Hamiltonian 


Monte Carlo (HMC) algorithm in the spirit of Zhang and Sutton 2011 . This as HMC algorithms are 
known to greatly increase mixing for some models when the gradient and log-likelihood can be evaluated 
analytically. However, no particle version of HMC has yet been proposed in the literature but the 


possibility is considered in the discussions following Girolami and Calderhead 2011 . 
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The source code and data for the LGSS model as well as some supplementary material are available 


from https://github.com/compops/qpmh2-sysid2015/ 
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A Implementation details 


In Algorithm [2] for the LGSS model, we use a fully adapted SMC algorithm with N = 50 and SMC-ABC 
with N = 2, 500, lag A = 12, e = 0.10 and p e as the Gaussian density with standard deviation e. For the 
aSV, we use the same settings expect for N = 5, 000. For qPMH2, we use the memory length M = 100, 
6 = 1,000 and ?rh y b = 2,500 samples for the hybrid method. We use K = 15,000 iterations (discarding 
the first Kg = 5,000 as burn-in) for all PMH algorithms and initialise in the maximum a posteriori 


(MAP) estimate obtained accordingly to Dahlin et al. 2015c 


The pre-conditioning matrix V is estimated by pilot runs using PMH0, with step sizes based on the 
Hessian estimate obtained in the MAP estimation. The final step sizes are given by the rules of thumb 

i.e. 


by Sherlock et al. 2015 and Nemeth et al. 2014 


el = 2.562 2 p~\ el = 1.125 2 p~ 1/3 . 


Finally, we use the following prior densities 


p(p) ~TM ( o,d(m; 0,0.2 2 ), p^) ~ TW(_ 1( !)(<(-; 0.9,0.05 2 ), 

p(cr v ) ~ G(cr v ; 0 . 2 , 0 . 2 ), p(a) ~ B{a/ 2 ; 6 , 2 ), 


where TAf( a ,b)(') denotes a truncated Gaussian distribution on [a, 6 ], G(a,b) denotes the Gamma distri¬ 
bution with mean a/b and B(a, b) denotes the Beta distribution. 

For SMC-ABC, we require the transformation r(i t ) to simulate random variables from the two mod¬ 
els. For the LGSS model, we use the identity transformation = x an d the Box-Muller transformation 
to simulate yt by 


Tg(x t ) = x t + ex e \J 2 log v tA cos(27ru tj2 ), 


where {u M ,t^ 2 } ~ U\ 0,1]. 

For the aSV model, we use = arctan(x) as proposed by |Yildmm et al. 2014 to make the 
variance in the gradient estimate finite. We generate samples from A(a, exp(* t )) for a 7 ^ 1 by 


Tg(x t ) = exp(a^/2) 


sm(avt t 2) 

cos^t ^)] 1 / 0 


cos [(a - 1 )v t p\ 


v t , 1 


where {v tl i,v t , 2 } ~ {Exp(l), U(— 7 r/ 2 ,7t/ 2)}. The real-world data in the aSV model is computed as 
yt = 100 [log(s t ) — log(st-i)], where s t denotes the price of a future contract on coffee obtained from 

https://www.quandl.com/CHRIS/ICE_KC2 
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B Implementation details for quasi-Newton proposal 


The quasi-Newton proposal adapts the Hessian estimate at iteration k using the previous M states of the 
Markov chain. Hence, the current proposed parameter depends on the previous M states and therefore 


this can be seen as a Markov chain of order M. Following Zhang and Sutton 2011 , we can analyse this 
chain as a first-order Markov chain on an extended M-fold product space Q AI . This results in that the 
stationary distribution can be written as 


M 

:M ) = ]^[ 7r(6»i), 
i =1 

where 7 x(6k,i) is defined as in ©■ To proceed with the analysis, we introduce the notation ipk = {Ok, u k } 
and V’fcjiiAAi f° r the vector ipkA-.M with ipk,i removed for brevity. We can then update a component of 
ip k ,i-.M using some transition kernel 1) defined by 


where B (rpi, tpi\ipk,i:M\i ) denotes the quasi-Newton PMH2 proposal adapted using the last M— 1 samples 
tpk,i:M\i, analogously to 0. This is similar to a Gibbs-type step in which we update a component 
conditional on the remaining M — 1 components. Here, this conditioning is used to construct the local 
Hessian approximation using the information in the remaining components. 

As this update leaves 7r(0k,i) invariant, it follows that 


T(lpk,l:Mi1Pk,l:M) = Ti O T 2 O • • • O T M , V4,1:m) ) 


leaves ir{0k,i:M) invariant. Hence, we can update all components of the M-dimensional Markov chain 
during each iteration k of the sampler. 

To implement the algorithm, we instead interpret the proposal as depending on the last M — 1 states 
of the Markov chain. This results in a sliding window of samples, which we use to adapt the proposal. 
This results in two minor differences from a standard PMH algorithm. The first is that we center the 
proposal around the position of the Markov chain at k — M, i.e. 


q{0'\%pk-M+i.k- 1 )=N(o'-e k -M, SBFGs(V’fc-M+l:fc-l )), (15) 

where EBFGs(V’fc-M+i:fc-i) = ~Bm{0') obtain by iterating Q. The second difference is that we set 
{Ok, Uk} 4— {Ok-M, Uk-M} is the candidate parameter O' is rejected. The complete procedure for propos¬ 
ing from q(0' k \ipk-M+i-.k-i) is presented in Algorithm [3j 


Some possible extensions to this noisy quasi-Newton inspired update are discussed by |Schraudolph| 









Algorithm 3 Quasi-Newton proposal 

Inputs: {Ok-M:k-ii u k-M:k-i} (last M states of the Markov chain), <5 > 0 (initial Hessian). 
Outputs: O' (proposed parameter). 


1 : 

2 : 

3 : 

4 : 

5 : 

6 : 

7 : 

8 : 

9 : 

10 : 

11 : 

12 : 


Extract the M* unique elements from {Ok-M+i-.k-i, Wfc-M+i:fc-i} and sort them in ascending order 
(with respect to the log-likelihood) to obtain {0*,u*}. 

if M* > 2 then 

Calculate Si and yi for l = 1,..., M — 2 using the ordered pairs in {0*, «*} and their corresponding 
gradient estimates. 

Initialise the Hessian estimate B C 1 = Pi (yjyi) -1 ^- 
for l = 1 to M * — 1 do 

Carry out the update <[tJ) to obtain B^ + [ . 

end for 

Set £ B FGs(^fc-M+l:fc-l) = -B M *{0'). 

else 


Set E B FGs(V'fe-M+i:fc-i) = <5I P - 

end if 

Sample from (15) to obtain O'. 


et al. 2007 and Zhang and Sutton 2011 . These include how to rearrange the update such that the 


Hessian estimate always is a PSD matrix. In this paper, we instead make use of the hybrid method 


discussed by Dahlin et al.| 2015b to handle these situations. In Schraudolph et al. 2007 , the authors 
also discuss possible alternations to handle noisy gradients, which could be useful in some situations. In 
our experience, these alternations do not always improve the quality of the Hessian estimate as the noisy 
is stochastic and not the result of a growing data set as in the aforementioned paper. 


C Additional results 

In this section, we present some additional plots for the LGSS example in Section |5.1| using PMH0 in 
Figure [4] and qPMH2 in Figure [5] 

We note that the mixing is better for qPMH2 as the ACF decreases quicker as the lag increased 
compared with PMH0. However due to the quasi-Newton proposal, we obtain a large correlation coef¬ 
ficient at lag M. This is also reflected in the trace plots, which exhibits a periodic behaviour. Hence, 
we conclude that the mixing is increased in some sense when using qPMH2 compared with PMH0. The 
exact magnitude of this improvement depends on the method to compute the IF values. 


D ct-stable distributions 


This appendix summarises some important results regarding a-stable distributions. For a more detailed 


presentation, see Nolan 2003 , Peters et al. 2012 and references therein. 
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Figure 4: Trace plots and ACF estimates (left) for fi (green), </> (orange) and a v (purple) in the LGSS 
model using pPMHO-ABC. The resulting posterior estimates (right) are also presented as histograms and 
kernel density estimates. The dotted vertical lines in the estimates of the posterior indicate its estimated 
mean. The grey lines indicate the parameter prior distributions. 
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Figure 5: Trace plots and ACF estimates (left) for fi (green), 0 (orange) and a v (purple) in the LGSS 
model using qPMH2-ABC. The resulting posterior estimates (right) are also presented as histograms and 
kernel density estimates. The dotted vertical lines in the estimates of the posterior indicate its estimated 
mean. The grey lines indicate the parameter prior distributions. 
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Definition 1 (a-stable distribution Nolan, 2003 ) An univariate a-stable distribution denoted by 
A(a , j3, c, p) has the characteristic function 


a, /3, c, p) = 


exp {itp — |cf|“ [l — i/3 tan (™) sgn(t )] } if a ^ 1, 
exp — |ct|“ 1 + ^-sgn{t) In |t| | if a = 1, 


where a £ [0,2] denotes the stability parameter, /3 £ [—1,1] denotes the skewness parameters, c £ R+ 
denotes the scale parameter and p £ ffi. denotes the location parameter. 

The a-stable distribution is typically defined through its characteristic function given in Defini¬ 
tion [I] Except for some special cases, we cannot recover the probability distribution function (pdf) from 
cj)(t; a, /3, c, p) as it cannot be computed analytically. These exceptions are; (i) the Gaussian distribution 
Af(p, 2c 2 ) is recovered when a = 2 for any /3 (as the a-stable distribution is symmetric for this choice of 
a), (ii) the Cauchy distribution C(p, c) is recovered when a = 1 and (3 = 0, and (iii) the Levy distribution 
C(j /, c) is recovered when a = 0.5 and /3 = 1. 

For all other choices of a and f3, we cannot recover the pdf from <[>(t; a , /3, c, p) due to the analytical 
intractability of the Fourier transform. However, we can often approximate the pdf using numerical 


methods as discussed in Peters et al. 2012 . In this work, we make use of another approach based on 


ABC approximations to circumvent the intractability of the pdf. For this, we require to be able to 


sample from A(a, /3, c, p) for any parameters. An procedure for this discussed by Chambers et al. 1976 
is presented in Proposition [T] 


Proposition 1 (Simulating a-stable variable [Chambers et al., 1976|) Assume that we can sim¬ 
ulate w ~ Exp(l) and u ~ U{—tt/ 2, n/2). Then, we can obtain a sample from A{a,f3, 1,0) by 


sin[q(-u+T c , i)8 )] 
(cos(aT, v ; 3 ) cos(u)) 1 ' /a 


cos [aT a t p + (a—1 ) u] 


y = 


- f3u) tan(w) - /? log 


*/«/ 1 , 

if oi = 1, 


(16) 


where we have introduced the following notation 


T a 1 3 = — arctan 
a 


0 3 


tan 


/7ra\ 

V~2 ~) 


A sample from A(a, /3,c, p) is obtained by the transformation 


f cy + p ifot^f 1, 

cy + (p + /3fclogc) ifa= 1. 
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